A reference-free approach for cell type classification with scRNA-seq
نویسندگان
چکیده
•Compressed k-mer groups (CKGs) are used to classify cell types without references•CKGs competitive gene expression features for type classification•CKGs associated with genes sharing specific k-mers Single-cell RNA sequencing (scRNA-seq) has become a revolutionary technology characterize cells under different biological conditions. Unlike bulk RNA-seq, from scRNA-seq is highly sparse due limited depth per cell. This worsened by tossing away significant portion of reads that attribute quantification. To overcome data sparsity and fully utilize original reads, we propose scSimClassify, reference-free alignment-free approach level features. The compressed (CKGs), identified the simhash method, contain similar abundance profiles serve as cells’ Our experiments demonstrate CKG lend themselves better performance than in classification accuracy majority experimental cases. Because CKGs derived raw alignment reference genome, scSimClassify offers an effective alternative existing methods especially when genome incomplete or insufficient represent subject genomes. Cataloging crucial understanding organization cells, disease mechanisms, even treatment respondences. makes it possible identify subpopulations exploring unique transcriptomic profile each Clustering most popularly partition based on transcriptome similarity unsupervised fashion (Andrews Hemberg, 2018Andrews T.S. Hemberg M. Identifying populations scrnaseq.Mol. aspects Med. 2018; 59: 114-122Crossref PubMed Scopus (104) Google Scholar). However, this requires well-established knowledge biomarkers annotation, well populations. Unfortunately, such information often unavailable prior (Kiselev et al., 2019Kiselev V.Y. Andrews Challenges clustering single-cell rna-seq data.Nat. Rev. Genet. 2019; 20: 273-282Crossref (263) Therefore, researchers turn other machine learning approaches, supervised classification, annotate automatically (Abdelaal 2019Abdelaal T. Michielsen L. Cats D. Hoogduin Mei H. Reinders M.J. Mahfouz A. A comparison automatic identification rna data.Genome Biol. 1-19Crossref (107) Recently, Abdelaal al. Scholar) benchmarked 22 identification. All these approaches utilized individual study included many conventional classifiers support vector (SVM) random forest (RF) addition few recently developed single cell-specific including ACTINN (Ma Pellegrini, 2020Ma F. Pellegrini Actinn: automated sequencing.Bioinformatics. 2020; 36: 533-538PubMed scPred (Alquicira-Hernandez 2019Alquicira-Hernandez J. Sathe Ji H.P. Nguyen Q. Powell J.E. scpred: accurate method cell-type 1-17Crossref (58) demonstrated efficacy profile-based In study, Arvind Iyer (Iyer 2020Iyer Gupta K. Sharma S. Hari Lee Y.F. Ramalingam N. Yap Y.S. West Bhagat A.A. Subramani B.V. al.Integrative analysis characterization circulating tumor cells.J. Clin. 9: 1206Crossref (12) classified naive Bayes, gradient boosting machine, RF fitted recognize diverse phenotypes. notorious its relatively low resulting across all (Yuan 2017Yuan G.C. Cai Elowitz Enver Fan G. Guo Irizarry R. Kharchenko P. Kim Orkin al.Challenges emerging directions analysis.Genome 2017; 18: 1-8Crossref (141) make things worse, read filters out unmapped reads. It not uncommon about half thrown final (Vieth 2019Vieth B. Parekh Ziegenhain C. Enard W. Hellmann I. systematic evaluation pipelines.Nat. Commun. 10: 1-11Crossref (87) Note bad Using standard genomes may eliminate representing variations particular subject, type, genome. Last but least, aligning derive gene-cell count matrix typically time-consuming step process. limitations, develop sidestepping mapping (Zielezinski 2019Zielezinski Girgis H.Z. Bernard Leimeister C.A. Tang Dencker Lau A.K. Röhling Choi J.J. Waterman M.S. al.Benchmarking sequence methods.Genome 1-18Crossref (53) Scholar; Shi Yip, 2019Shi C.H. Yip K.Y. K-mer counting memory consumption enables fast alignment.bioRxiv. 2019: 723833Google Specifically, explores novel entirety Instead using use k-mers, referred genomic words, classification. Intuitively, words can be extracted sample. Each “words” own “frequency” abundance, which defined number times appears change gene/transcript will correspondingly affect abundances identifying them. Thus their strong association genes/transcripts. advantage here easily references. meantime, set also captures subject-specific do fit challenge huge hundreds millions depending depth. large size blessing achieve scalability. We observe expressed very similarly samples, group same gene/transcript. These redundant true feature space. one popular objects (Jiang 2004Jiang Zhang Cluster data: survey.IEEE Trans. Knowl. Data Eng. 2004; 16: 1370-1386Crossref (863) they feasible unknown clusters high computational cost dealing directly. Various have been past field metagenomics reduce features, restricted applications only case control experiments. case, significantly differentiate were selected further (LaPierre 2019LaPierre Ju C.J.T. Zhou Wang Metapheno: critical deep metagenome-based prediction.Methods. 166: 74-82Crossref (30) 2018Wang Y. Fu Ren Yu Z. Chen Sun group-specific sequences microbial communities long signatures.Front. Microbiol. 872Crossref (6) cannot applied up two-group comparison. Often times, multi-class problem dozen more experiment. paper, reduces space partitioning into subsets variety via approach. achieved repurposing (Charikar, 2002Charikar Similarity estimation techniques rounding algorithms.in: Proceedings Thiry-Fourth Annual ACM Symposium Theory Computing. Association Computing Machinery, 2002: 380-388Crossref (1465) Scholar), extremely algorithm detect items within set. evaluate datasets generated breast cancer tissues immune populations, blood samples studying peripheral mononuclear (PBMCs) COVID-19 influenza patients. accurately aggregated (CKG features). find top-ranked biologically meaningful consistency best our knowledge, first information. Besides improving general accuracy, Figure 1 describes overview training steps takes real-value input. assembled sequenced preprocessed filter unreliable Here, define k-mers. input, simhash-based generator (simGG) implemented three steps: (1) generate k-mers’ n-bit fingerprints, (2) (3) determine matrix. Finally, uses detailed description provided STAR Methods. goal experiment was data. Two two PBMCs (Table S1) evaluation. compared commonly (referred GE following) application conducted thorough comparisons among numerous purpose (RF, GBM, MLP SVM) between Benchmarking assign identities, concluded performed classifiers. Several combinations length, k fingerprint size, n explored study. length either 16 21 (Dieffenbach 1993Dieffenbach Lowe Dveksler General concepts pcr primer design.PCR Methods Appl. 1993; 3: S30-S37Crossref (221) while bit 16-bit 32-bit. investigate whether essential categories inputs, listed Table 1. category (on left 1) reference-based selection, containing data; second right contained mapped genome.Table 1The nomenclature (id) combination parameter valuesidknread typeidknread typeallk21n162116allmappedk21n162116mappedallk21n322132allmappedk21n322132mappedallk16n161616allmappedk16n161616mappedallk16n321632allmappedk16n321632mappedThe take Open table new tab imbalanced data, calculated F1 score module scikit-learn library. class weighted contribution (Pedregosa 2011Pedregosa Varoquaux Gramfort Michel V. Thirion Grisel O. Blondel Prettenhofer Weiss Dubourg al.Scikit-learn: python.J. Mach. Learn. Res. 2011; 12: 2825-2830Google experiment, evaluated scSimClassify's testing named intra-dataset made reporting results following groups: (a) classifier MLP, RF, SVM. GE, 8 1). (b) ACTINN, stratified 5-fold cross-validation select hyperparameter scSimClassify. For pipelines, five independent repetitions results. datasets, Chuang's dataset (Chung 2017Chung Eum H.H. H.O. K.M. H.B. K.T. Ryu H.S. Park Y.H. al.Single-cell comprehensive tumour profiling primary cancer.Nat. 8: 1-12Crossref (384) PBMC3k (10x Genomics, 201610x 2016. Pbmcs Healthy Donor, Single Cell Immune Profiling Dataset Ranger 1.1.0 .Google compare CKE ability As reported 2, overall winner SVM wins outperform trained model, quite consistently improves over supports hypothesis annotation sufficient classification.Table 2Performance classification(A) datasetFeature setAccuracyF1# FeaturesMLPGE0.938±0.0170.936±0.01811353±42allkn160.941±0.0230.939±0.0265323±488allk21n320.94±0.0250.938±0.02712939±204allk16n160.942±0.0230.939±0.0266213±564allk16n320.942±0.0250.94±0.02614191±218mappedk21n160.935±0.0230.933±0.0265248±542mappedk21n320.935±0.0250.933±0.02712334±472mappedk16n160.932±0.0230.929±0.0256117±544mappedk16n320.934±0.0220.932±0.02413443±222RFGE0.916±0.0220.906±0.02711353±42allk21n160.916±0.0240.906±0.0285323±488allk21n320.926±0.0210.92±0.02512939±204allk16n160.895±0.0290.881±0.0356213±564allk16n320.921±0.0210.914±0.02514191±218mappedk21n160.915±0.0260.906±0.0315248±542mappedk21n320.931±0.0160.926±0.0212334±472mappedk16n160.892±0.0250.877±0.0316117±544mappedk16n320.914±0.0240.905±0.02813443±222GBMGE0.925±0.0190.92±0.02211353±42allk21n160.911±0.0250.906±0.0265323±488allk21n320.923±0.020.917±0.02312939±204allk16n160.915±0.0220.91±0.0266213±564allk16n320.922±0.0250.918±0.02814191±218mappedk21n160.92±0.0230.914±0.0275248±542mappedk21n320.928±0.0170.923±0.0212334±472mappedk16n160.918±0.0210.912±0.0246117±544mappedk16n320.918±0.020.912±0.02313443±222SVMGE0.94±0.0170.938±0.01811353±42allk21n160.94±0.0250.937±0.0295323±488allk21n320.931±0.020.928±0.02212939±204allk16n160.938±0.0250.934±0.0286213±564allk16n320.936±0.0230.933±0.02514191±218mappedk21n160.936±0.0240.934±0.0275248±542mappedk21n320.93±0.0210.927±0.02412334±472mappedk16n160.937±0.0220.933±0.0256117±544mappedk16n320.934±0.0230.931±0.02613443±222ACTINNGE0.906±0.0240.9±0.02624613±228scPredGE0.896±0.0250.919±0.02438913(B) SetAccuracyF1# FeaturesMLPGE0.87±0.0140.866±0.01516115±18allk21n160.893±0.0150.892±0.0166691±364allk21n320.892±0.0140.892±0.0148191±101allk16n160.893±0.0120.893±0.0126299±353allk16n320.891±0.0130.891±0.0137959±104mappedk21n160.894±0.0130.894±0.0136645±372mappedk21n320.89±0.0130.89±0.0138187±109mappedk16n160.894±0.0130.894±0.0136275±352mappedk16n320.893±0.0120.892±0.0127938±117RFGE0.856±0.0160.856±0.01616115±18allk21n160.879±0.010.876±0.016691±364allk21n320.891±0.0130.889±0.0148191±101allk16n160.881±0.0120.877±0.0136299±353allk16n320.888±0.0120.886±0.0127959±104mappedk21n160.883±0.0130.879±0.0146645±372mappedk21n320.887±0.0130.885±0.0148187±109mappedk16n160.884±0.0110.881±0.0126275±352mappedk16n320.888±0.0140.886±0.0147938±117GBMGE0.879±0.010.874±0.01116115±18allk21n160.887±0.0120.885±0.0126691±364allk21n320.893±0.0130.892±0.0148191±101allk16n160.887±0.0110.884±0.0116299±353allk16n320.891±0.0140.889±0.0147959±104mappedk21n160.889±0.0130.887±0.0146645±372mappedk21n320.891±0.0140.889±0.0158187±109mappedk16n160.889±0.0130.888±0.0136275±352mappedk16n320.895±0.0150.893±0.0157938±117SVMGE0.888±0.0140.885±0.01416115±18allk21n160.895±0.0140.894±0.0146691±364allk21n320.905±0.0140.904±0.0148191±101allk16n160.894±0.0130.894±0.0146299±353allk16n320.905±0.0130.904±0.0147959±104mappedk21n160.894±0.0150.894±0.0156645±372mappedk21n320.905±0.0140.905±0.0148187±109mappedk16n160.897±0.0130.896±0.0136275±352mappedk16n320.905±0.0160.904±0.0167938±117ACTINNGE0.856±0.0140.856±0.01512477±28scPredGE0.87±0.0180.89±0.01732738Comparison (listed 1), mean deviation recorded metrics after cross-validation. performances highlighted bold. Comparison understand effect parameters n, k, performance. By fixing classifiers, types, 21-mer comparing 16-mer 10 cases both datasets. example, allk21n16 2.1% allk16n16 dataset. indicates lead finer resolution diversity. ultimately result While fixed grouped 32-bit fingerprints around two-thirds Theoretically, hyperplanes separate space, thus creating precise categorization fingerprints. eventually leads descriptive performances. those three-quarters suggests able capture relevant preselecting contribute additional gain Based allk21n32 Inferring variable common current bioinformatics (Brennecke 2013Brennecke Anders J.K. Ko?odziejczyk X. Proserpio Baying Benes Teichmann S.A. Marioni J.C. al.Accounting technical noise experiments.Nat. 2013; 1093Crossref (509) necessity top 2000 default settings Seurat VST (Butler 2018Butler Hoffman Smibert Papalexi E. Satija Integrating conditions, technologies, species.Nat. Biotechnol. 411-420Crossref (2899) (allk21n32) sets. Classification intra-datasets shown S2. Comparing S2), there no clear classifier-dependent dataset-dependent. Moreover, inferring subset exclude discriminant sources variation introduce selection parameters. Therefore if other, inter-dataset sets predict shared types. model Karaayvazr's (Karaayvaz 2018Karaayvaz Cristea Gillespie S.M. Patel A.P. Mylvaganam Luo C.C. Specht M.C. Bernstein B.E. Michor Ellisen L.W. Unravelling subclonal heterogeneity aggressive states tnbc through rna-seq.Nat. 1-10Crossref (152) consist tissues. PBMC sets, patient, FLU healthy donor Lee's (Lee 2020Lee J.S. Jeong H.W. Ahn J.Y. S.J. Nam S.K. Sa Kwon al.Immunophenotyping covid-19 highlights role i interferons development severe covid-19.Sci. Immunol. 5: eabd1554Crossref (272) obtain optimal hyperparameters target distribution, randomly chose 20% 80% targeting validation testing, respectively. grid search (Feurer Hutter, 2019Feurer Hutter Hyperparameter optimization.in: Kotthoff Vanschoren Automated Machine Learning. Springer, 3-33Crossref well-performing set, allk21n32, suggested intra-dataset, 3 represents averaged 5 repetitions. Overall, shows highest detecting followed GBM Again show almost metrics. outperforms failed task tuning pronounced intra-data configuration.Table 3iPerformance datasetsFeature FeaturesMLPGE0.69±0.0450.69±0.04511381allk21n320.764±0.0390.764±0.03912958RFGE0.803±0.0070.803±0.00711381allk21n320.828±0.0030.828±0.00312958GBMGE0.842±0.0050.842±0.00511381allk21n320.828±0.0020.828±0.00212958SVMGE0.692±0.0070.692±0.00711381allk21n320.872±0.0030.872±0.00312958ACTINNGE0.838±0.0280.852±0.02317061±36Comparison (allk21n3) models Chuang’s tested Karaayvazr’s (Figure 2),
منابع مشابه
Computational approaches for interpreting scRNA‐seq data
The recent developments in high-throughput single-cell RNA sequencing technology (scRNA-seq) have enabled the generation of vast amounts of transcriptomic data at cellular resolution. With these advances come new modes of data analysis, building on high-dimensional data mining techniques. Here, we consider biological questions for which scRNA-seq data is used, both at a cell and gene level, and...
متن کاملClustering scRNA-Seq Data using TF-IDF
In this abstract, we propose several computational approaches for clustering scRNA-Seq data based on the Term Frequency Inverse Document Frequency (TF-IDF) transformation that has been successfully used in the field of text analysis. Empirical evaluation on simulated cell mixtures with different levels of complexity suggests that the TF-IDF methods consistently outperform existing scRNA-Seq clu...
متن کاملa new type-ii fuzzy logic based controller for non-linear dynamical systems with application to 3-psp parallel robot
abstract type-ii fuzzy logic has shown its superiority over traditional fuzzy logic when dealing with uncertainty. type-ii fuzzy logic controllers are however newer and more promising approaches that have been recently applied to various fields due to their significant contribution especially when the noise (as an important instance of uncertainty) emerges. during the design of type- i fuz...
15 صفحه اولdropClust: efficient clustering of ultra-large scRNA-seq data.
Droplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbour search technique to develop a de novo clustering algorithm for large-scale single cell data. On a numb...
متن کاملa cauchy-schwarz type inequality for fuzzy integrals
نامساوی کوشی-شوارتز در حالت کلاسیک در فضای اندازه فازی برقرار نمی باشد اما با اعمال شرط هایی در مسئله مانند یکنوا بودن توابع و قرار گرفتن در بازه صفر ویک می توان دو نوع نامساوی کوشی-شوارتز را در فضای اندازه فازی اثبات نمود.
15 صفحه اولذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: iScience
سال: 2021
ISSN: ['2589-0042']
DOI: https://doi.org/10.1016/j.isci.2021.102855